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Two methods are presented with which the CPU time spent on the calculation of radiative corrections can 
be significantly reduced. The first is the parallelization of the program, which can be surprisingly simple to 
implement under certain circumstances often met in the calculation of radiative corrections. The second is the 
efficient direct calculation of fermion chains. The latter not only improves the overall performance of the program, 
but introduces better conceptual clarity as well, as it allows for a homogeneous treatment of bosonic and fermionic 
amplitudes. 



1. Introduction 

The calculation of radiative corrections is in 
general a very CPU-time consuming business, es- 
pecially if "aggravating conditions" are met, such 
as the scans over large areas in parameter space 
typical for extensions of the Standard Model. 

The running time of a program that computes 
radiative corrections can be written as 

T = Nt + e , (1) 

where N is the number of points in phase and 
parameter space and r is the time to compute 
one such point. The remainder e accounts for all 
code which is executed only n -C N times, like 
initializations of model parameters, etc., and is 
thus generally negligible. Typical numbers are 
N ~ 10 5 . . . 10 7 and r ~ 100 ms. 

The two techniques presented in this contribu- 
tion reduce CPU time via both factors: 

• Parallelization of the program reduces exe- 
cution time from O(N) to 0(N/N plocessols ), 

• Directly computing fermion chains (in pro- 
cesses involving external fermions), rather 
than using the standard trace technique, re- 
duces r from 0(n 2 F ) to 0(np), where np is 
the number of fermionic structures. 

The following discussion is limited to numeri- 
cal programs. The examples are written in For- 
tran 77, which is not only still in wide use among 



numerical programmers, but moreover consti- 
tutes a "lowest common denominator" in the 
sense that if a technique can be implemented in 
Fortran 77, it can be implemented in almost any 
language. 

2. Parallelization 

Parallelization in general is a very difficult 
topic, and considerable effort has been spent on 
the development of sophisticated libraries, com- 
pilers, and other tools. In the business of com- 
puting radiative corrections, however, programs 
often contain "naturally" parallel structures, for 
example if the same calculation is performed for 
different values of a parameter, viz. 

do tanBeta = 2, 20 

xsec = CrossSection(tanBeta) 

print *, tanBeta, xsec 
enddo 

Such programs are known as essentially parallel, 
as each cycle of the loop may be executed inde- 
pendently without having to rearrange any code. 
In contrast, a parallelizing compiler will gener- 
ally have to apply advanced techniques such as 
loop interchange and cache tiling before arriving 
at parallelizable code (see e.g. [|lj). 

The parallelization of such loops turns out to be 
surprisingly simple on a symmetric multiprocess- 
ing (SMP) architecture, that is, a single machine 
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with several processors. Driven by demand in ar- 
eas such as commercial Web hosting, SMP ma- 
chines have meanwhile become affordable even for 
cash-strapped institutes and are well supported 
by all major operating systems. 

The two basic operations are fork and wait: 
A fork creates an exact copy of the process that 
calls it; it returns to the child process and the 
child's process id to the parent. The opposite is 
wait, which suspends the caller until one of its 
child processes has terminated. 

Although fork and wait are not strictly part 
of the ANSI Fortran-77 Standard, they are avail- 
able on many Unix systems .[] They belong to a 
library of wrapper functions that provide Fortran 
access to certain functions of the C library. These 
functions are sometimes known also as the "3F" 
functions because they are described in section 
3F of the Unix man pages. 

The implementation is straightforward: The 
parent process executes the main loop and in each 
cycle of the loop forks off a child process to do the 
actual calculation. The concurrent processes are 
automatically sent on the available processors by 
the operating system. 

The reason why the implementation is so sim- 
ple is that the forked processes run completely 
independently. This means that except for screen 
(and possibly file) output there are no resource- 
sharing issues to take care of, like avoiding simul- 
taneous write access to some variable. By the 
same token, there is no simple way (in pure For- 
tranQ) for the child process to communicate its 
results back to the parent. If the results only 
have to be stored in a file, however, such prob- 
lems can simply be solved by opening a unique 
output file for each child process and redirecting 

1 One notable exception is the g77 compiler. It is not dif- 
ficult to work around this, however: set up a short C pro- 
gram, wrapper. c, which contains 

#include <unistd.h> 

#include <sys/wait.h> 

int fork_() { return forkO ; } 

int wait _ (int * status) { return wait (status) ; } 
and add wrapper. c to the g77 command line, viz. 

g77 (options) myprogram.f wrapper. c (libs) 
2 Two possibilities are: 1) Regular Fortran file I/O, but 
with named pipes. 2) Data exchange via Unix domain 
sockets, although this requires some minor mixed-lan- 
guage programming in C. 



standard output to that file if necessary. 
Consider the following example program: 

1 seqnum = 

2 processes = 

3 cpus = 4 

4 

5 do tanBeta = 2, 20 

6 if ( processes .It. cpus ) then 

7 processes = processes + 1 

8 else 

9 pid = wait(0) 

10 endif 

11 seqnum = seqnum + 1 
12 

13 if( forkO .eq. ) then 

14 write (log, ' ( "log . " , 15 . 5) ' ) seqnum 

15 open(6, file=log) 

16 xsec = CrossSection(tanBeta) 

17 print *, tanBeta, xsec 

18 return 

19 endif 

20 enddo 
21 

22 do i = 1 , processes 

23 pid = wait(0) 

24 enddo 

The main do loop (£. 5-20) has been extended 
with respect to the serial version and falls into 
two blocks. The first block (£. 6-11) is executed 
by the parent only. The parent calls fork in £. 13 
and immediately skips to the next cycle of the 
loop. The statements inside the if block (£. 14- 
18) are executed by the child process only. 

The parent takes care not to spawn more child 
processes than there are processors available, so 
as not to overload the system. The number of 
concurrently running child processes is stored in 
the variable processes and the number of pro- 
cessors in cpus. The first cpus cycles of the loop 
"fill up" the available processors with jobs. After 
that the parent waits for a running child process 
to finish before proceeding to the next fork (£. 9). 

The child process, for which fork returns 0, 
branches into the if block. It first generates a 
unique file name using a sequence number (£. 14) 
which it then opens as unit 6, thus redirecting 
terminal output to this file. In this way it is guar- 
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anteed that the output of concurrent processes is 
not mixed up. Most time is spent on the actual 
calculation (£. 16-17) which is identical to the se- 
rial version. Unlike in the serial version, however, 
the child process does not return to the main loop, 
but terminates at the return statement (£. 18). 

After completing the main loop, the parent 
has to wait until all child processes have finished 
{£. 22-24). 

The overhead incurred by a fork is negligible 
unless very short units are parallelized, fork is a 
very common operation on preemptive multitask- 
ing systems such as Unix and is therefore usually 
implemented very efficiently. The Linux kernel, 
for instance, uses copy-on-write memory pages, 
that is, a memory page is not duplicated until ei- 
ther parent or child write on it. In fact, it will 
be difficult to notice any performance penalty of 
the parallel version running on a single CPU com- 
pared to the serial version. 

If care is taken not to overload the system with 
more child processes than there are processors 
available (which would only require unnecessary 
task switches and use up memory resources) it is 
reasonable to expect a near-optimal speed-up, i.e. 
T — > T/N on N processors. 

3. Fermionic Objects 

Amplitudes of processes with external fermions 
can be written as 

M=J2^F Z (2) 

i=l 

where the Fi are (products of) fermion chains, 
i.e. are of the form Y[ {u\ I\ \v), where u and v are 
spinors, and I\ is a product of Dirac matrices. 

One usually proceeds to compute probabilities, 
e.g. |A4| 2 , rather than the amplitude M. itself, 
because one can then use the trace technique: 

\M\ 2 =Y j c* C jF*F j (3) 
where F*Fj is computed as 

= Tr(f\ \u)(u\ Tj \v)(v\). 



The advantage is that no explicit representation 
of the spinors is needed since the projection op- 
erators can be expressed through Dirac matrices 
only, e.g. \u\) (u\\ = |(1 + X'js)^ for a massless 
fermion. 

The problem with the trace technique is that 
it scales as tip. That is, one needs to compute 
all n F combinations of F*Fj and not just rip of 
the F{ as for M alone. For rip = 20, for example, 
one has to compute 400 traces. More severely, 
matters get worse the more vectors are in the 
game, e.g. in multi-particle final states, or with 
polarization effects, because generally all combi- 
nations of vectors can appear in the T; , and thus 
rip ~ (number of vectors)!. 

The obvious alternative to the trace technique 
is to insert the 4-dimensional representation of 
the spinors and the Dirac matrices and work out 
the matrix algebra. This procedure is not entirely 
straightforward to implement in a language like 
Fortran, however, because in general Lorcntz in- 
dices connect different fermion chains. Also, the 
calculational efficiency is not quite optimal be- 
cause the 4-dim. representation contains redun- 
dancy, e.g. entire 2x2 blocks in the Dirac matri- 
ces are zero. 

Actually, the 4-dim. representation is used for 
mathematical convenience only and, from the 
physical point of view, fermions are indeed more 
naturally represented by 2-dim. objects which 
come in two kinds, left- and right-handed. This 
makes the Weyl representation an obvious choice. 

The Weyl representation can in principle be 
used from the very beginning, i.e. already at the 
level of the Feynman rules [ ^| . This is only some- 
what problematic for loop calculations, where 
regularization is needed, since it is not obvious 
how to extend the 2-dim. objects appearing in the 
Feynman rules to D dimensions. In the present 
approach therefore everything is kept 4 (or D) 
dimensional during the algebraic simplification, 
such that e.g. 7 M 7 M — » D can be replaced, and 
the Weyl representation is inserted only at the 
very end, just before the numerical evaluation, 
and only for objects involving external fermions 
(i.e. not for internal fermion traces). 

The Dirac matrices are given in the Weyl rep- 
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resentation by 
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(5) 



where the matrices are defined in terms of the 
ordinary Pauli matrices a = (<7i, a 2 , 03) as 



(t,-a). 



Introducing furthermore 2-dim. spinors via 

(«! = ((«+! ,(«_!), I«> = (|^j), 



(6) 



(7) 



it is a simple exercise to show that every chiral 
4-dim. Dirac chain can be converted to a single 
2-dim. sigma chain: 

(u\ w_7 M 7„ ■■■\v) = (u_|oy7„ ■ ■ ■ \v±) , (8) 
(u\ w+7 M 7 v •••!?;) = (u+| cr^ • • • \v T ) . (9) 

That is, going from the 4-dim. to the 2-dim. rep- 
resentation does not increase tif- 

The two most important identities satisfied by 
the 'ct' matrices are the Fierz identities [ || 

(A\a^\B) (C\a»\D) = 2(A\D) (C\B) , 
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(A\ ( a^ \B) (C\W \D) = 2{A\e \C) {B\ e \D) 
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where e = ( _°i J ) is the spinor metric. 

These identities allow to completely remove all 
Lorentz contractions between sigma chains. For 
the implementation in a computer program this 
is a decisive simplification, for it means that after 



application of the Fierz identities one can calcu- 
late one sigma chain at a time, independent of 
any other sigma chains. 

To make the implementation even simpler, one 
can express also the remaining kinematical ob- 
jects through according to 

9nu = 2 Tr(o-M^) > (I 2 ) 
£\pv P = 2 Tr(erAa>avCTp) — g\pg vp 

+ a\v9w~ 9\p9pv ■ (13) 

Now the four-vectors themselves are no longer 
needed, but only their contractions with ( cf. 



j 1 , 1 k ~f~ k k lk 



k° - k 3 -k 1 + \k' 
-k 1 - ik 2 k° + k 3 



k, (14) 
=: t. 

(15) 



Altogether now, the following objects and op- 
erations need to be implemented: 



• arrays: all spinors, all vectors: 

m - (5) • 



k * k 1 



(c S) 



(16) 
(17) 



• functions: spinor x spinor (ss), matrix x 
spinor (ms), matrix x matrix (mm): 



(u\v) ~ (ill u 2 ) ■ (^J , 



k \v 



(18) 
(19) 

These almost trivial functions are sufficient 
to compute arbitrary sigma chains by re- 
peated application, e.g. 

(u\ht 2 k 3 \v) = (21) 
ss(u, ms(A;i, ms(fc2, ms(/c3, v)))) . 

Note that in order to nest the functions in- 
side each other as above, the matrix-valued 
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functions have to write their results to an 
intermediate array (a kind of accumulator 
register) and return the location of that ar- 
ray. 

Finally, one should note that when the fermion 
chains are computed directly, the amplitude Ai 
becomes a handy complex number, rather than a 
"bunch of form factors," {ciFi}. That is, to com- 
pute for example \M\ 2 , one does not need to sum 
up a list of numbers, but only take the square of 
a complex variable. In this sense fermionic am- 
plitudes are now treated on the same footing as 
bosonic amplitudes. Furthermore, since the Cj no 
longer have to be factored out in front of the F, , 
the analytic expression for M. may be rearranged 
freely to yield the most compact form. 
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4. Summary 

Two methods of speeding up the calculation of 
radiative corrections have been demonstrated. 

Parallelization 

Programs for computing radiative corrections 
are often natural candidates for parallelization. 
It is not difficult to attain near-optimal speed-up, 
i.e. T — > T/N on N processors. 

On SMP machines, it is very easy to parallelize 
Fortran programs with only few changes in the 
code. The two basic functions fork and wait 
are available by default in many Fortran libraries, 
and even if not, only a trivial wrapper program 
in C is needed. 

Direct calculation of fermion chains 

The number of fermionic structures that have 
to be computed can be reduced from 0{n 2 F ), with 
the trace technique, to 0(np) when calculating 
the fermion chains directly. Moreover, with the 
latter approach one gets a simple complex number 
for the amplitude M. 

Using the Weyl representation, the computa- 
tion of the fermion chains is broken down from a 
4-dim. to a 2-dim. problem. The Fierz identities 
allow to completely disentangle fermion chains 
connected by Lorentz indices. 

The implementation is simple, even more so if 
one systematically expresses all kinematic objects 



